In [ ]:
%%HTML
<style>
.container { width:100% }
</style>
In this notebook we need both sklearn
and pandas
. These can be installed using the following commands:
conda install scikit-learn
conda install pandas
We import the module pandas
. This module implements so called data frames and is more convenient than the module csv
when reading a csv file.
In [ ]:
import pandas as pd
The data we want to read is contained in the csv file 'cars.csv'
.
In [ ]:
cars = pd.read_csv('cars.csv')
cars.head()
We want to convert the columns containing mpg
into one NumP arry, while the remaining numerical attributes should be collected into a
feature matrix.
In [ ]:
import numpy as np
In [ ]:
X = np.array(cars[['cyl', 'displacement', 'hp', 'weight', 'acc', 'year']])
Y = np.array(cars['mpg'])
Let us inspect the first five rows of the matrix X
.
In [ ]:
X[:5]
Since miles per gallon is in a reciprocal relation to the fuel consumption, we convert Y
to its inverse.
In [ ]:
Y = 1 / Y
We import the linear_model
from SciKit-Learn:
In [ ]:
import sklearn.linear_model as lm
We create a linear model.
In [ ]:
M = lm.LinearRegression()
We train this model using the data we have.
In [ ]:
M.fit(X, Y)
The model M
represents a linear relationship between the dependent variable $1/\texttt{mpg}$ and the independent variables $\texttt{cyl}$, $\texttt{displacement}$, $\texttt{hp}$, $\texttt{weight}$, $\texttt{acc}$, and $\texttt{year}$ of the form
$$\displaystyle \frac{1}{\texttt{mpg}}
= \vartheta_0 + \vartheta_1 \cdot \texttt{cyl}
+ \vartheta_2 \cdot \texttt{displacement}
+ \vartheta_3 \cdot \texttt{hp}
+ \vartheta_4 \cdot \texttt{weight}
+ \vartheta_5 \cdot \texttt{acc}
+ \vartheta_6 \cdot \texttt{year}
$$ We proceed to extract the coefficients $\vartheta_i$ for $i\in{1,\cdots,6}$.
In [ ]:
ϑ0 = M.intercept_
ϑ0
In [ ]:
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5, ϑ6 = M.coef_
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5, ϑ6
Let us check how much of the variance is explained by our model.
In [ ]:
R2 = M.score(X, Y)
R2
The linear model explains $88\%$ of the variation of the fuel efficiency. In order to derive a better model, we would need both the reference area of the car and the drag coefficient.
In [ ]: